Object detection in videos has drawn increasing attention recently with theintroduction of the large-scale ImageNet VID dataset. Different from objectdetection in static images, temporal information in videos is vital for objectdetection. To fully utilize temporal information, state-of-the-art methods arebased on spatiotemporal tubelets, which are essentially sequences of associatedbounding boxes across time. However, the existing methods have majorlimitations in generating tubelets in terms of quality and efficiency.Motion-based methods are able to obtain dense tubelets efficiently, but thelengths are generally only several frames, which is not optimal forincorporating long-term temporal information. Appearance-based methods, usuallyinvolving generic object tracking, could generate long tubelets, but areusually computationally expensive. In this work, we propose a framework forobject detection in videos, which consists of a novel tubelet proposal networkto efficiently generate spatiotemporal proposals, and a Long Short-term Memory(LSTM) network that incorporates temporal information from tubelet proposalsfor achieving high object detection accuracy in videos. Experiments on thelarge-scale ImageNet VID dataset demonstrate the effectiveness of the proposedframework for object detection in videos.
展开▼